Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload

نویسندگان

  • Shrainik Jain
  • Bill Howe
چکیده

In this work-in-progress paper, we extract a set of curation idioms from a five-year corpus of hand-written SQL queries collected from a Database-as-a-Service platform called SQLShare. The idioms we discover in the corpus include structural manipulation tasks (e.g., vertical and horizontal recomposition), schema manipulation tasks (e.g., column renaming and reordering), and value manipulation tasks (e.g., manual type coercion, null standardization, and arithmetic transformations). These idioms suggest that users find SQL to be an appropriate language for certain data curation tasks, but we find that applying these idioms in practice is sufficiently awkward to motivate a set of new services to help automate cleaning and curation tasks. We present these idioms, the workload from which they were derived, and the features they motivate in SQL to help automate tasks. Looking ahead, we describe a generalized idiom recommendation service that can automatically apply appropriate transformations, including cleaning and curation, on data ingest.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Beyond “If then”-- Three Techniques for Cleaning Character Variables from Write-in Questions

In survey studies, cleaning answers to write-in questions can be difficult and time consuming, especially when the same response may be written in multiple ways. Misunderstanding of the survey question, unrecognizable handwriting, and negligence in data entry are major factors leading to data inaccuracies that are almost impossible to avoid. Writing a series of “if then” statements is a classic...

متن کامل

Study of the foundation, models and issues of research data curation and management in scientific and academic environments

Background and Aim: The purpose of this paper is to study, identifying and discuss the foundation and concepts, models and frameworks, dimensions and challenges of research data curation and management in scientific and academic environments. Method: This article is a review article and library method was used to collect scientific and research texts in this field. In this research, external an...

متن کامل

Analyzing SQL Query Logs using Multi-Relational Graphs

Computer Science 6 (Data Management), FAU Erlangen-Nürnberg {andreas.wahl|richard.lenz}@fau.de Analytical SQL queries are a valuable source of information. They contain expert knowledge that cannot be inferred from schemas or content alone. Consider, for example, data lake scenarios, where relational and semi-structured data sources are combined in a single storage and processing environment. D...

متن کامل

Supporting the curation of biological databases with reusable text mining.

Curators of biological databases transfer knowledge from scientific publications, a laborious and expensive manual process. Machine learning algorithms can reduce the workload of curators by filtering relevant biomedical literature, though their widespread adoption will depend on the availability of intuitive tools that can be configured for a variety of tasks. We propose a new method for suppo...

متن کامل

Data Curation with Deep Learning [Vision]: Towards Self Driving Data Curation

Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly ad-hoc solutions that are either domain-specific (for example, ETL rules) or task-specific (for example, entity resolut...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016